video representation
HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities.In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments.Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.Project page: https://uark-aicv.github.io/HENASY
Unsupervised Learning of View-invariant Action Representations
The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.
Appendices
Note thatppos is task-specific; here we use the class oracle,i.e. the ImageNet-100 labels,todefinethepositivesamples. In Figure 1, we plot theproxy task performance, i.e. the percentage of queries where the key is ranked over all negatives, across training for MoCo [19], MoCo-v2 [10] and some variants inbetween. As mentioned above, all results in Figure1areforthesameฯ =0.2. Ablations showed that this yields at best performance as good as mixingwiththequery,butonaverageabout0.1-0.2%lower. This weighing scheme also resulted in slightly inferior results.
Self-supervisedCo-training forVideoRepresentationLearning
Weshowthattheanswerisno,intworespects: First, we show that hard positives are being neglected in the self-supervised training, and that if these hard positives are included then the quality of learnt representation improves significantly. Toinvestigatethis,weconduct anoracleexperiment where positivesamples areincorporated into the instance-based training process based on the semantic class label.